========================================================

Introduction


Prosper or Prosper Marketplace, is a leader in the online peer-to-peer lending industry. Borrowers create profiles and listings (request loans) on Prosper.com. Borrowers can apply for a fixed-rate - fixed-term loan between $2,000 and $40,000. Investors, either individuals or institutions, view the listing (borrower’s loan request) and decide how much to lend the borrower towards the loan.

Interest rates are typically lower for the borrower than going to a financial institution, such as a bank and multiple investors can contribute to one borrower’s loan request, limiting the overall risk impact of the borrower defaulting on the loan for any one investor and providing a higher yield.

As an amateur, I find the idea of peer-to-peer lending quite interesting from a borrower and an investor standpoint. Through this analysis, I would like to determine which conditions(variables) might determine Loan repayment of users.

Data Overview


The Prosper Loan dataset contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
The dataset was last updated on 03/11/2014.

Structure of the dataset


The dimensions of the dataset and list of all the variables are as follows:

## [1] 113937     81
##  [1] "ListingKey"                         
##  [2] "ListingNumber"                      
##  [3] "ListingCreationDate"                
##  [4] "CreditGrade"                        
##  [5] "Term"                               
##  [6] "LoanStatus"                         
##  [7] "ClosedDate"                         
##  [8] "BorrowerAPR"                        
##  [9] "BorrowerRate"                       
## [10] "LenderYield"                        
## [11] "EstimatedEffectiveYield"            
## [12] "EstimatedLoss"                      
## [13] "EstimatedReturn"                    
## [14] "ProsperRating..numeric."            
## [15] "ProsperRating..Alpha."              
## [16] "ProsperScore"                       
## [17] "ListingCategory..numeric."          
## [18] "BorrowerState"                      
## [19] "Occupation"                         
## [20] "EmploymentStatus"                   
## [21] "EmploymentStatusDuration"           
## [22] "IsBorrowerHomeowner"                
## [23] "CurrentlyInGroup"                   
## [24] "GroupKey"                           
## [25] "DateCreditPulled"                   
## [26] "CreditScoreRangeLower"              
## [27] "CreditScoreRangeUpper"              
## [28] "FirstRecordedCreditLine"            
## [29] "CurrentCreditLines"                 
## [30] "OpenCreditLines"                    
## [31] "TotalCreditLinespast7years"         
## [32] "OpenRevolvingAccounts"              
## [33] "OpenRevolvingMonthlyPayment"        
## [34] "InquiriesLast6Months"               
## [35] "TotalInquiries"                     
## [36] "CurrentDelinquencies"               
## [37] "AmountDelinquent"                   
## [38] "DelinquenciesLast7Years"            
## [39] "PublicRecordsLast10Years"           
## [40] "PublicRecordsLast12Months"          
## [41] "RevolvingCreditBalance"             
## [42] "BankcardUtilization"                
## [43] "AvailableBankcardCredit"            
## [44] "TotalTrades"                        
## [45] "TradesNeverDelinquent..percentage." 
## [46] "TradesOpenedLast6Months"            
## [47] "DebtToIncomeRatio"                  
## [48] "IncomeRange"                        
## [49] "IncomeVerifiable"                   
## [50] "StatedMonthlyIncome"                
## [51] "LoanKey"                            
## [52] "TotalProsperLoans"                  
## [53] "TotalProsperPaymentsBilled"         
## [54] "OnTimeProsperPayments"              
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"    
## [57] "ProsperPrincipalBorrowed"           
## [58] "ProsperPrincipalOutstanding"        
## [59] "ScorexChangeAtTimeOfListing"        
## [60] "LoanCurrentDaysDelinquent"          
## [61] "LoanFirstDefaultedCycleNumber"      
## [62] "LoanMonthsSinceOrigination"         
## [63] "LoanNumber"                         
## [64] "LoanOriginalAmount"                 
## [65] "LoanOriginationDate"                
## [66] "LoanOriginationQuarter"             
## [67] "MemberKey"                          
## [68] "MonthlyLoanPayment"                 
## [69] "LP_CustomerPayments"                
## [70] "LP_CustomerPrincipalPayments"       
## [71] "LP_InterestandFees"                 
## [72] "LP_ServiceFees"                     
## [73] "LP_CollectionFees"                  
## [74] "LP_GrossPrincipalLoss"              
## [75] "LP_NetPrincipalLoss"                
## [76] "LP_NonPrincipalRecoverypayments"    
## [77] "PercentFunded"                      
## [78] "Recommendations"                    
## [79] "InvestmentFromFriendsCount"         
## [80] "InvestmentFromFriendsAmount"        
## [81] "Investors"

Feature Selection


As it is not efficient to use all 81 variables for my analysis, the variables I picked for my analysis are:

  • Term — The length of the loan expressed in months.
  • ListingCategory — The category of the listing that the borrower selected when posting their listing:
    • 0 - Not Available
    • 1 - Debt Consolidation
    • 2 - Home Improvement
    • 3 - Business
    • 4 - Personal Loan
    • 5 - Student Use
    • 6 - Auto
    • 7 - Other
    • 8 - Baby&Adoption
    • 9 - Boat
    • 10 - Cosmetic Procedure
    • 11 - Engagement Ring
    • 12 - Green Loans
    • 13 - Household Expenses
    • 14 - Large Purchases
    • 15 - Medical/Dental
    • 16 - Motorcycle
    • 17 - RV
    • 18 - Taxes
    • 19 - Vacation
    • 20 - Wedding Loans
  • LoanStatus — The current status of the loan:
    • Cancelled
    • Chargedoff
    • Completed
    • Current
    • Defaulted
    • FinalPaymentInProgress
    • PastDue. The PastDue status will be accompanied by a delinquency bucket.
  • LoanOriginalAmount — The origination amount of the loan.
  • EmploymentStatus — The employment status of the borrower at the time they posted the listing.
  • EmploymentStatusDuration — The length in months of the employment status at the time the listing was created.
  • IncomeRange — The income range of the borrower at the time the listing was created.
  • IncomeVerifiable — The borrower indicated they have the required documentation to support their income.
  • CreditScoreRangeLower — The lower value representing the range of the borrower’s credit score as provided by a consumer credit rating agency.
  • CreditScoreRangeUpper — The upper value representing the range of the borrower’s credit score as provided by a consumer credit rating agency.
  • DebtToIncomeRatio — The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%).
  • BorrowerAPR — The Borrower’s Annual Percentage Rate (APR) for the loan.
  • IsBorrowerHomeowner — A Borrower will be classified as a homeowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
  • ProsperScore — A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009.
  • ProsperRating..Alpha — The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009.
  • BorrowerState — The number of days delinquent.

Univariate Analysis


To get an idea of the various variables and their relationships with each other and to the output variable, I will be conducting univariate analysis by plotting bar charts and histograms.
At the end of the univariate analysis I will have a better idea about the variables to use in the bivariate and multivariate analysis.


I will subset a dataframe with only the selected variables.

## 'data.frame':    113937 obs. of  16 variables:
##  $ Term                     : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ ListingCategory..numeric.: int  0 2 0 16 2 1 1 2 7 7 ...
##  $ LoanStatus               : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ LoanOriginalAmount       : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ EmploymentStatus         : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IncomeRange              : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable         : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ BorrowerAPR              : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerState            : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ CreditScoreRangeLower    : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper    : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ IsBorrowerHomeowner      : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ DebtToIncomeRatio        : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ ProsperScore             : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ProsperRating..Alpha.    : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...

LoanStatus Analysis

  • From this plot, it can be seen that the highest number of borrowers have Current loans, followed by Completed loans.
  • I aggregated all the past due categories into the Delinquent category.

Delinquent describes something or someone that fails to accomplish what is required by law or duty, such as the failure to make a required payment or perform a certain action. In a financial sense, delinquency occurs as soon as a borrower misses a payment.

Defaulted is failure to meet the legal obligations (or conditions) of a loan. It occurs when a borrower fails to repay the loan as specified in the original contract. Most creditors allow a loan to remain delinquent for a period of time before considering it as default; the duration depends on the creditor and loan type.

A Chargedoff loan is deemed unlikely to be collected by the creditor because the borrower has become substantially delinquent after a period of time. Traditionally, creditors will make this declaration at the point of six months without payment.


Create a new column to categorize borrowers based on LoanStanding

  • Categorizing the borrowers into ‘Bad’ and ‘Good’ standing in a new variable based on their loan status.
  • The borrowers with ‘Completed’ and ‘Current’ and ‘Cancelled’ loans are categorized into Good.
  • I’m considering ‘Cancelled’ loans to be loan forgiveness which falls into ‘Good’ category.
  • The borrowers with ‘Delinquent’, ‘Defaulted’ and ‘Chargedoff’ loans are categorized into Bad.

Loan Term Analysis

## loan$Term: 12
##  Bad Good 
##   92 1522 
## -------------------------------------------------------- 
## loan$Term: 36
##   Bad  Good 
## 17029 70749 
## -------------------------------------------------------- 
## loan$Term: 60
##   Bad  Good 
##  1956 22589

  • From the plot, it can be observed that the loan terms are divided into 12, 36 & 60 month periods.
  • The 36 month loan term is the most popular with 87778 borrowers opting for it.
  • The 12 month loan term is the least popular with only 1614 borrowers opting for it.
  • The proportion of borrowers with bad standing is 5.7%, 19.4% and 7.9% for 12, 24 & 36 month term loans.

ListingCategory Analysis

  • Listing category represents the reason why a borrower is requesting a loan.
  • The above plot shows that the majority of loans are under the category ‘Debt Consolidation’.
  • One thing to remember is that ‘Debt Consolidation’ is a vague term, as the debt could be due to any of the other listing categories. For example debt due to a car purchase, wedding purchases, etc. Many borrowers may have debt from multiple sources.
  • I scaled the y axis with a log10 transformation to view the data distribution better.
  • The scaled plot shows that the “Boat and”RV" ListingCategories have the least proportion of borrowers with Bad Standing Loans.

LoanOriginalAmount Analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

  • The summary shows a range of $1000 to $35000.
  • The mean($8337) and median($6500) are considerably apart. This is shown in the plot.
  • There are significant spikes at around $4000, $10000, $15000, $20000 and $25000.
  • These spikes at higher loan amounts drag the mean higher.

## loan$LoanStanding: Bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3000    4500    6624    8800   35000 
## -------------------------------------------------------- 
## loan$LoanStanding: Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    7000    8682   12500   35000

  • Focusing the plot to 90% of the data, it can be seen that maximum number of loan amounts are rounded.
  • For example, there are spikes at about $500 intervals, with significant spikes at $1000 intervals.
  • This makes sense as most borrowers would quote numbers rounded to the nearest $500 or $1000 rather than the exact required amount. This also makes it easy for the investors.
  • From the plot, the proportion of Bad Loan Standing borrowers seems to reduce as the loan amount increases. This can happen due to higher loan amounts being approved for borrowers with good history, like credit score, etc. This relationship can be investigated later.

EmploymentStatus Analysis

  • The categorization in the variable is vague. For example, ‘Part-time’, ‘Full-Time’, Self-employed’ come under ‘Employed’.
  • There is a category for borrowers who have not given employment details.
  • Markedly high number of borrowers are employed.
  • Significantly higher number of employed borrowers with good loan standing.

Employment Duration Analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   26.00   67.00   95.51  137.00  745.00
  • I set the EmploymentStatusDuration of all Not employed to 0 and removed NA’s. Due to this, the data shows employment duartion rather than employment status duration.
  • In the first plot, it is seen that the data is right skewed.
  • Focusing on the majority of data in the second plot, it is seen that borrowers with less employment duration are comparitively higher loan risk.
  • Doing a log10 transformation in the third plot, the data still looks skewed. Moving forward i will not use this transformation as I cannot gain any new insights.

IncomeRange Analysis

  • The categories in the variable were optimized. ‘Not employed’ is considered as ‘$0’. The factors are ordered.
  • Looking at the plot, the highest number of borrowers have an income range of ‘$25,000-49,999’ followed by ’$50,000-74,999.
  • The proportion of borrowers with bad loan standing is highest in the ‘$1-24,999’.
  • The number of borrowers with ‘$100,000+’ income range is relatively very high, which is an interesting find.

Create a new column ‘CreditScore’

  • Check the difference in CreditScoreRangeUpper and CreditScoreRangeLower for all the borrowers to look for any patterns.
## 
##     19 
## 113346
  • It can be seen that the difference in CreditScoreRangeUpper and CreditScoreRangeLower is constant at 19.
  • There are 591 null values.
  • Based on the above observation, adding 9.5 to the CreditScoreRangeLower gives the average CreditScore of the borrowers.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   669.5   689.5   695.1   729.5   889.5     591

CreditScore Analysis

  • As there were outliers, I trimmed the lower and upper 1% data.
  • It is almost normally distributed. There is a difference of 5.6 between the mean and median.
  • Since the mean is greater than the median it is marginally right skewed. This does not warrant a log transformation.
  • An interesting find, is that the sequential difference in CreditScore is 20.
  • Looking at the plot, most of the borrowers seem to have CreditScore between 600 and 800.
  • The proportion of Bad loan standing borrowers decreases with higher CreditScore.
  • The proportion of Bad loan standing borrowers with CreditScore below 600 is very high, over 50%.

DebtToIncomeRatio Analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

  • Outliers present which were removed by only considering 99% of the data.
  • Data is right skewed. Not significant to perform a log transformation.
  • The LoanStanding for borrowers with higher DebtToIncome Ratio seems to be Good. This shows that borrowers with higher income are able to handle more debt and still be successful completing loan payments.

BorrowerAPR Analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229      25

  • The data is Right skewed. The mean is greater than the median.
  • The mode appears to be around 0.35, which is significantly greater than the mean and median.
  • The borrowers with bad LoanStanding have higher APR as expected.
  • The right skew of the plot shows that the relation between BorrowerAPR and other variables like CreditScore and DebtToIncomeRatio warrants a deeper look.

IsBorrowerHomeowner Analysis

  • There seems to be an insignificant difference in the number of borrowers who own a home and those who do not.
  • The proportion of borrowers with Bad LoanStanding seems to be higher in borrowers who do not own a home.

Prosper Score Analysis

  • Prosper Score for borrowers was introduced after July 2009.
  • ProsperScore data is NA for all borrowers prior to July 2009. By removing NA’s, I’ll be filtering data prior to July 2009.

  • Looking at the plot, it is evident that most of the borrowers have a ProsperScore from 4 to 8.
  • Highest number of borrowers have ProsperScore of 4, followed by 6 & 7.

  • The proportion of Bad LoanStanding borrowers is higher at lower Prosper Score as expected.

Prosper Rating Analysis

  • Similar to Prosper Score, Prosper Rating is also only assigned to loans after July 2009.
  • A Rating of ‘AA’ is the best while a rating of ‘HR’ is the worst.

  • The highest number of borrowers loans have a rating of C.
  • As expected the maximum data lies into the ‘B’, ‘C’, ‘D’ categories, which is a medium rating.

  • The loans with low Prosper Rating of ‘HR’, ‘E’ & ‘D’ have comparatively higher Bad LoanStanding borrowers.
  • Just from the numbers, loans with rating ‘D’ have the highest number of borrowers.

Create a new column ‘ProsperRiskRating’ to categorize based on ProsperScore and ProsperRating

  • Created a new column ‘ProsperRiskRating’, categorizing borrowers after July 2009 into ‘High’, ‘Medium’ and ‘Low’ risk.
  • Borrowers with Prosper Score of 1,2,3,4 & Prosper Rating of ‘HR’, ‘E’, ‘D’ categorized into “High”.
  • Borrowers with Prosper Score of 5,6,7 & Prosper Rating of ‘C’, ‘B’ categorized into “Medium”.
  • Borrowers with Prosper Score of 8,9,10,11 & Prosper Rating of ‘A’, ‘AA’ categorized into “Low”.

ProsperRiskRating Analysis

  • Plot shows that the number of borrowers with Medium Prosper Risk Rating is highest, followed by High Prosper Risk Rating.

  • The proportion of Bad LoanStanding borrowers is significantly lesser in Low Prosper Risk Rating borrowers (4%).
  • The proportion of Bad LoanStanding borrowers is highest in High Prosper Risk Rating borrowers (15%).

Borrower State Analysis

  • Create a new column with full State names to match the abbreviated names in BorrowerState.
  • Group the data by State.

## grp_by_state$BorrowerStateName
##        alabama         alaska        arizona       arkansas     california 
##           1679            200           1901            855          14717 
##       colorado    connecticut       delaware        florida        georgia 
##           2210           1627            300           6720           5008 
##         hawaii          idaho       illinois        indiana           iowa 
##            409            599           5921           2078            186 
##         kansas       kentucky      louisiana          maine       maryland 
##           1062            983            954            101           2821 
##  massachusetts       michigan      minnesota    mississippi       missouri 
##           2242           3593           2318            787           2615 
##        montana       nebraska         nevada  new hampshire     new jersey 
##            330            674           1090            551           3097 
##     new mexico       new york north carolina   north dakota           ohio 
##            472           6729           3084             52           4197 
##       oklahoma         oregon   pennsylvania   rhode island south carolina 
##            971           1817           2972            435           1122 
##   south dakota      tennessee          texas           utah        vermont 
##            189           1737           6842            877            207 
##       virginia     washington  west virginia      wisconsin        wyoming 
##           3278           3048            391           1842            150
  • From the map and the table, it can be seen that the maximum number of borrowers are from California followed by Texas, New York & Florida.
  • This can be justified, as Prosper started in California. California is also a highly populated state.
  • The mid-west region states like Wyoming, North Dakota, South Dakota, Iowa have very less number of borrowers.

## grp_by_state_LoanStanding_wide$BorrowerStateName
##        alabama         alaska        arizona       arkansas     california 
##          23.47          10.00          18.73          11.11          17.08 
##       colorado    connecticut       delaware        florida        georgia 
##          12.13          11.00          12.00          16.80          22.50 
##         hawaii          idaho       illinois        indiana           iowa 
##          11.98          22.04          18.78          14.63          36.02 
##         kansas       kentucky      louisiana          maine       maryland 
##          14.60          14.14          15.41          36.63          17.30 
##  massachusetts       michigan      minnesota    mississippi       missouri 
##          12.44          17.90          16.87          14.74          23.67 
##        montana       nebraska         nevada  new hampshire     new jersey 
##          16.97           9.50          12.11          15.97          12.37 
##     new mexico       new york north carolina   north dakota           ohio 
##          17.80          10.86          17.51          28.85          15.58 
##       oklahoma         oregon   pennsylvania   rhode island south carolina 
##          15.14          19.43          12.21          13.79          12.21 
##   south dakota      tennessee          texas           utah        vermont 
##          15.87          14.68          15.04          20.41          10.63 
##       virginia     washington  west virginia      wisconsin        wyoming 
##          12.48          18.11          15.35          13.90          10.00
  • The state with the highest proportion of Bad LoanStanding borrowers is Maine followed by Iowa and North Dakota.
  • The lowest proportion of Bad LoanStanding borrowers is in Nebraska, followed by Alaska & Wyoming.

## grp_by_state_ProsperRisk_wide$BorrowerStateName
##        alabama         alaska        arizona       arkansas     california 
##          53.41          48.50          47.02          55.53          44.69 
##       colorado    connecticut       delaware        florida        georgia 
##          45.58          43.70          38.95          48.87          46.07 
##         hawaii          idaho       illinois        indiana         kansas 
##          47.37          54.34          42.66          48.37          50.35 
##       kentucky      louisiana       maryland  massachusetts       michigan 
##          51.80          49.94          48.24          43.47          50.99 
##      minnesota    mississippi       missouri        montana       nebraska 
##          46.36          54.96          51.65          47.96          51.71 
##         nevada  new hampshire     new jersey     new mexico       new york 
##          52.73          47.86          46.90          48.34          46.37 
## north carolina           ohio       oklahoma         oregon   pennsylvania 
##          49.57          49.45          47.68          49.18          50.60 
##   rhode island south carolina   south dakota      tennessee          texas 
##          45.23          48.19          54.50          50.65          46.27 
##           utah        vermont       virginia     washington  west virginia 
##          48.66          47.95          45.88          47.14          48.39 
##      wisconsin        wyoming 
##          46.22          55.28

## grp_by_state_ProsperRisk_wide$BorrowerStateName
##        alabama         alaska        arizona       arkansas     california 
##          13.06          17.37          16.80          11.31          20.91 
##       colorado    connecticut       delaware        florida        georgia 
##          18.91          18.57          20.22          16.76          19.42 
##         hawaii          idaho       illinois        indiana         kansas 
##          15.79          14.64          20.57          17.43          14.99 
##       kentucky      louisiana       maryland  massachusetts       michigan 
##          13.96          14.27          16.51          20.15          14.42 
##      minnesota    mississippi       missouri        montana       nebraska 
##          18.13          13.33          17.40          19.46          16.22 
##         nevada  new hampshire     new jersey     new mexico       new york 
##          14.16          15.58          17.87          16.31          19.20 
## north carolina           ohio       oklahoma         oregon   pennsylvania 
##          18.37          14.50          18.99          19.44          15.57 
##   rhode island south carolina   south dakota      tennessee          texas 
##          17.36          18.41           8.99          14.95          17.87 
##           utah        vermont       virginia     washington  west virginia 
##          18.77          16.96          19.65          19.65          18.06 
##      wisconsin        wyoming 
##          18.61          17.89
  • States of Iowa, Maine & North Dakota do not have borrowers with ProsperRiskRating, meaning, no Prosper users post July 2009.
  • Proportion of loans with High ProsperRisk is highest in Arkansas followed by Wyoming & Mississippi.
  • Proportion of Low ProsperRisk loans to borrowers is highest in California followed by Illinois, Delaware & Massachusetts.

Univariate Analysis Summary


Did you create any new variables from existing variables in the dataset?

In order to facilitate my analysis of the Prosper loans data, I created new variables from existing variables.

  • ListingCategoryName — It contains the Listing Category name corresponding to the number in the ListingCategory..numeric. column.

  • BorrowerStateName — This variable contains the state names corresponding to the abbreviation in the BorrowerState column.

  • CreditScore — This variable contains the average CreditScore of a borrower. It is calculated by taking the average of the CreditScoreRangeLower and corresponding CreditScoreRangeUpper.

  • LoanStanding — This variable is derived from the LoanStatus variable. Categorizes the loan risk attributed to the borrowers based on their loan status.
    • Bad - Borrowers with loan status: PastDue, Default & Chargedoff.
    • Good - Borrowers with loan status: Current, Completed, Cancelled & FinalPaymentInProgress.
  • ProsperRiskRating — This variable is derived from ProsperScore and ProsperRating..Alpha.. Categorizes the risk based on the Prosper Score and the Prosper Rating of the loan. Applicable for loans originated after July 2009.
    • High - Loans with ProsperScore “1,2,3,4” and ProsperRating “HR”,“E”,“D” (Worst possible combination)
    • Medium - Loans with ProsperScore “5,6,7” and ProsperRating “B”,“C”
    • Low - Loans with ProsperScore “8,9,10,11” and ProsperRating “A”,“AA” (Best possible combination)
    • One thing to note is that Prosper assigns more loans with a lower ProsperRating than ‘A’,‘AA’, because they would rather be wrong about a bad loan actually turning out good than a good loan turning out bad.

What are the main features of interest in your dataset?

  • The main features of interest in my dataset are LoanRisk and ProsperRiskRating.
  • ProsperRiskRating is derived from Prosper Score and Prosper Rating.
  • The Prosper score estimates the probability of a loan going “bad,” where “bad” is the probability of going 60+ days past due within the first twelve months from the date of loan origination.
  • Prosper score is a variable that Prosper assigns to loans to determine the risk they pose to investors. This score takes into account a number of variables such as debt to income ratio, loan payment performance on prior loans, credit card utilization, number of delinquent accounts and others.
  • Prosper rating is based on the Prosper score and credit score.

Supporting features to help investigate the features of interest?

  • IsBorrowerHomeowner — I want to see which category of borrowers (owning or not owning a home) is associated with good and bad Loan Standing. If a borrower owns a home, that means a bank has already approved them for a loan. This would be a good sign to indicate trust.
  • BorrowerAPR
  • CreditScore
  • DebtToIncomeRatio
  • EmploymentStatusDuration
  • EmployementStatus — The categories in this variable seem rather vague and are probably not useful at predicting loan Standing.
  • IncomeRange — Investigate if higher incomes are associated with greater loan payback success.
  • LoanOriginalAmount

Unusual Observations & operations to tidy or adjust the data.

  1. The most unusual findings from the analysis so far are —
  • There are some borrowers with an employement duration of more than 60yrs.
  • The number of borrowers with an income of “$100,000+” is very high (more than “$75000-99,999”). This could be because larger number of users with high income consider a peer-to-peer loan.
  • There are borrowers with CreditScore of 0. One of the reasons could be that they are young borrowers with no previous credit history.
  • DebtToIncomeRatio data shows a right skewed distribution. One of the reasons could be because, borrowers with higher income are able to handle more debt. Credit score histogram should overlap.
  • The Borrower APR data also shows a right skewed distribution. Based on previous observations and assumptions I expected this plot to be left skewed, i.e., the borrowers with high credit scores should qualify for lower APR.
  • The states of Maine, Iowa & North Dakota do not have any borrowers post July 2009.
  1. The operations I did to tidy or adjust data are —
  • Converted the Term variable to categorical with “12”, “36”, “60” categories.
  • Combined all the Past due categories to “Delinquent” in LoanStatus variable.
  • Created a new column, Loan Standing based on the Loan Status of borrowers, categorizing into Good and Bad Standing.
  • Created a new column, Listing Category Name containing the name of the listing category corresponding to its number.
  • I set the EmploymentStatusDuration of all Not employed to 0 and removed NA’s. Due to this, the data shows employment duartion rather than employment status duration.
  • In the IncomeRange variable, I set the income of all Not Employed to $0. Ordered the categories in ascending order.
  • Considered only Verifiable Income in the Income Range Analysis.
  • Created a new Variable for CreditScore containing the average of CreditScoreRangeLower and CreditScoreRangeUpper.
  • Converted the data in Prosper Score variable to factor. Ordered the factors in an ascending fashion.
  • Converted the Prosper Rating variable to factor and ordered it from worst to best.
  • Created a new variable, ProsperRiskRating where i categorized borrowers based on the best, average and worst combinations of ProsperScore and ProsperRating.
  • Created a new variable, BorrowerStateName which contains the full name of the borrower’s state. It is useful to group and map the data.
  • Filtered out the NA’s in all variables when plotting.

Bivariate Analysis


Plot Matrix of the non-categorical variables and LoanRisk

  • There does not appear to be strong linear correlations between the variables.
  • The relationship between “CreditScore - BorrowerAPR”, “CreditScore - LoanOriginalAmount” & “CreditScore - DebtToIncomeRatio” needs a deeper look.
  • The relationsip of LoanStanding with the other variables is as expected, with Bad Loan Standing associated with high BorrowerAPR, low CreditScore, low LoanOriginalAmount and low EmploymentStatusDuration.

Plot Matrix of the non-categorical variables and ProsperRiskRating

  • There are no strong correlation between the variables.
  • There is a moderate negative relationship between CreditScore and BorrowerAPR which is expected.
  • The relationship of ProsperRiskRating with other variables is expected, with High Risk associated with low LoanOriginalAmount, High BorrowerAPR, High DebtToIncomeRatio and Low CreditScore.
  • An interesting find is that the mean EmploymentStatusDuartion is very close for High, Medium and Low ProsperRiskRating, with less employment duration having less risk.

Plot Matrix of the categorical variables and ProsperRiskRating

  • The relationships between each other are not obvious except with our features of interest - LoanStanding and ProsperRiskRating.

BorrowerAPR by CreditScore Analysis

## [1] -0.4297073
  • In the first plot , we can see that most of the data is concentrated above credit score 500.
  • In the second plot, focusing on the majority of data (99%), we can see that the Borrower APR is lower at higher Credit Score.
  • In the mean and median plot, we see that there is a negative correlation between BorowerAPR and Credit Score (Increasing Credit Score = Lower BorowerAPR).

Loan Original Amount by Credit Score Analysis

## [1] 0.3408745
  • There appears to be a positive correlation between Credit Score and Loan amount.
  • Looking at the line graph, we see an initial plateau (low credit score = less loan amount), a linear increase (increasing cedit score = increasing loan amount), a plateau with occasional peaks (increasing credit score = almost similar loan amounts, with occasional high loan amounts) and a linear increase at the end (incresing cedit score = increasing loan amount).
  • There is a moderate positive correlation between the variables.

DebtToIncomeRatio by CreditScore Analysis

## [1] -0.01316852
  • The distribution of data does not look linear in the first plot.
  • Looking at the mean and median plots, the data distribution is almost a parabola.
  • There appear to be three regions in the plots.
  • The first for which credit scores are relatively low and debt-to-income ratios are decreasing.
  • The second for which debt-to-income ratios have a peak mean and median but with average credit scores.
  • The third for which debt to income ratios are low but credit scores are good.
  • The two ends of the plot might be associated with the lowest and highest income ranges.

IncomeRange Analysis

  • From the plot we can see that the mean & median CreditScore increases with higher IncomeRange.
  • There seem to be significant outliers in the “$1-24,999” & “$100,000+” categories.
  • The range of CreditScore is high in the “$0” category. This could be because of Job loss or bankruptcy.
  • With the exception of ‘$0’ category, The mean and median BorrowerAPR reduces with increasing income which is expected.
  • There is a significant difference between mean & median of ‘$0’ category and a wide Interquartile range.
  • The mean & median BorrowerAPR for ‘$0’ category is lower than all other categories. This could explain the DebtToIncomeRatio by CreditScore plot.

Listing Category Analysis

  • ‘Debt Consolidation’ has the highest mean & median loan amount.
  • The mean loan amount for ‘Baby&Adoption’ is the second highest.
  • The mean & median BorrowerAPR is highest for ‘Cosmetic Procedure’ followed by ‘Household Expenses’.
  • The lowest mean & median BorrowerAPR is for ‘Personal Loan’ and ‘Not Available’.
  • The interquartile range of CreditScore is widest for ‘Not Available’. The mean & median CreditScore is least for the same.
  • The highest mean & median CreditScore is for ‘Boat’ which is expected.

Term Analysis

  • The 60 month loan term has the highest mean and median loan amounts.
  • The 12 month loan period has the lowest mean and median loan amounts.
  • The mean and median BorrowerAPR seems to be very close for all loan terms.
  • The interquartile range of BorrowerAPR for 60 month term loans is the least.

IsBorrowerHomeowner Analysis

## loan$IsBorrowerHomeowner: False
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3000    5000    7034   10000   35000 
## -------------------------------------------------------- 
## loan$IsBorrowerHomeowner: True
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    8000    9617   15000   35000

## loan$IsBorrowerHomeowner: False
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   649.5   689.5   675.3   709.5   869.5     591 
## -------------------------------------------------------- 
## loan$IsBorrowerHomeowner: True
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.5   669.5   709.5   714.3   749.5   889.5

## loan$IsBorrowerHomeowner: False
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.16988 0.22362 0.22960 0.29371 0.51229      25 
## -------------------------------------------------------- 
## loan$IsBorrowerHomeowner: True
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01647 0.14494 0.19645 0.20825 0.26762 0.41355

## loan$IsBorrowerHomeowner: False
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.210   0.276   0.310  10.010    5253 
## -------------------------------------------------------- 
## loan$IsBorrowerHomeowner: True
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.150   0.220   0.276   0.320  10.010    3301

## loan$IsBorrowerHomeowner: False
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   18.00   50.00   78.97  108.00  755.00    5052 
## -------------------------------------------------------- 
## loan$IsBorrowerHomeowner: True
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    36.0    84.0   111.5   159.0   733.0    2573

## loan$IsBorrowerHomeowner: False
##             $0      $1-24,999 $25,000-49,999 $50,000-74,999 $75,000-99,999 
##            947           5751          19985          13929           6025 
##      $100,000+           NA's 
##           4700           5122 
## -------------------------------------------------------- 
## loan$IsBorrowerHomeowner: True
##             $0      $1-24,999 $25,000-49,999 $50,000-74,999 $75,000-99,999 
##            480           1523          12207          17121          10891 
##      $100,000+           NA's 
##          12637           2619
  • The mean and median LoanOriginalAmount is higher (2583, 3000 respectively) for Homeowners which shows an expected relationship between the two.
  • The mean and median CreditScore is higher (39, 20 respectively) for Homeowners.
  • The mean and median BorrowerAPR is slightly lower (0.021, 0.027) for Homeowners.
  • The median DebtToIncomeRatio is marginally lower (0.01) for non-homeowners and the mean is the same. The mean and median are significantly far apart signifying alot of outliers.
  • Looking at the plot we can see that there is no significant relationship between DebtToIncomeRatio and IsBorrowerHomeowner. I will not be considering this relationship for future analysis.
  • The mean and median EmploymentStatusDuration is higher (33, 34) for Homeowners.
  • In the IncomeRange by HomeOwner plot we can see that a significant number of non - homeowners have an income below $75,000. A significantly higher number of Homeowners have an income higher than $50,000.
  • In conclusion, the analysis shows that Homeowners have longer employment duration, higher CreditScore, Lower APR and make more money than non Homeowners. I expect a significant relationship between IsBorrowerHomeowner and LoanStanding.

Loan Standing Analysis

## loan$LoanStanding: Bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   19.00   53.00   81.67  115.00  755.00    3085 
## -------------------------------------------------------- 
## loan$LoanStanding: Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   69.00   98.29  141.00  745.00    4540

## loan$LoanStanding: Bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1400  0.2200  0.3425  0.3400 10.0100    1706 
## -------------------------------------------------------- 
## loan$LoanStanding: Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.150   0.220   0.263   0.310  10.010    6848

## loan$LoanStanding: Bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00864 0.18977 0.25627 0.25384 0.31033 0.50633 
## -------------------------------------------------------- 
## loan$LoanStanding: Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15016 0.20268 0.21178 0.27246 0.51229      25

## loan$LoanStanding: Bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   609.5   669.5   655.5   709.5   869.5     174 
## -------------------------------------------------------- 
## loan$LoanStanding: Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   669.5   709.5   703.0   729.5   889.5     417
  • The employment duration is less for bad standing loans. The mean and median are lower than good standing loans.
  • A statistical test for significance is needed to check if the difference in means and medians for good and bad standing loans is significant.
  • The DebtToIncome Ratio plot shows an expected trend of higher DebtToIncome Ratio for bad standing loans.
  • An interesting find is that the median DebtToIncome Ratio for good and bad standing loans is the same.
  • The mean DebtToIncome Ratio for bad standing loans is greater than the 3rd quartile. This shows that there are many outliers with high DebtToIncome ratio.
  • The BorrowerAPR plot also shows an expected trend of higher APR for bad standing loans.
  • Borrowers with a higher APR should have a harder time paying back loans so it is to be expected that their loans will have a higher rate of having a bad standing.

ProsperRiskRating Analysis

## loan_2009$ProsperRiskRating: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    4000    6651   10000   35000 
## -------------------------------------------------------- 
## loan_2009$ProsperRiskRating: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    5500   10000   11153   15000   35000 
## -------------------------------------------------------- 
## loan_2009$ProsperRiskRating: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    5000   10000   11507   15000   35000

## loan_2009$ProsperRiskRating: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   28.00   71.00   99.86  144.00  755.00       9 
## -------------------------------------------------------- 
## loan_2009$ProsperRiskRating: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0      32      78     107     153     733       8 
## -------------------------------------------------------- 
## loan_2009$ProsperRiskRating: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    31.0    73.0   102.4   145.0   644.0       2

## loan_2009$ProsperRiskRating: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.08496 0.24856 0.29486 0.29020 0.34105 0.42395 
## -------------------------------------------------------- 
## loan_2009$ProsperRiskRating: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07922 0.17061 0.19323 0.19445 0.21945 0.39153 
## -------------------------------------------------------- 
## loan_2009$ProsperRiskRating: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04583 0.09736 0.12528 0.12115 0.14206 0.24807

## loan_2009$ProsperRiskRating: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.170   0.250   0.295   0.350  10.010    5133 
## -------------------------------------------------------- 
## loan_2009$ProsperRiskRating: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1600  0.2200  0.2472  0.3100 10.0100    1580 
## -------------------------------------------------------- 
## loan_2009$ProsperRiskRating: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1200  0.1800  0.1939  0.2400 10.0100     583

## loan_2009$ProsperRiskRating: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   609.5   669.5   689.5   689.5   709.5   869.5 
## -------------------------------------------------------- 
## loan_2009$ProsperRiskRating: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   609.5   689.5   709.5   711.0   729.5   889.5 
## -------------------------------------------------------- 
## loan_2009$ProsperRiskRating: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   649.5   729.5   749.5   755.9   789.5   889.5
  • Only borrowers post July 2009 have a ProsperRiskRating.
  • The mean and median LoanOriginalAmount is significantly low for High Risk loans.
  • The mean and median LoanOriginalAmount is very close for Medium & Low Risk loans.
  • The mean & median Employment Duration for borrowers is very close to each other for all loans, which is an interesting find. I expected High risk loans to have a significantly lower Employment Duration compared to Low risk loans.
  • The trend in the BorrowerAPR is expected. There is a significant increase in mean and median of BorrowerAPR from Low risk to Medium risk to High Risk.

Bivariate Analysis Summary


How did the features of interest vary with other features in the dataset?

  • The median Employment duration is higher for borrowers of Good Loan Standing. The difference on the mean and median Employment duration is very less (17,16), so i will not consider this feature for further analysis.
  • The mean DebtToIncome Ratio is significantly higher for Bad Loan standing borrowers. The median is the same for both Good and Bad standing.
  • The mean and median for BorrowerAPR is significantly higher for Bad Loan standing borrowers which is expected.
  • The mean and median for CreditScore is lower for Borrowers with Bad Loan Standing. There is a 40 point difference in the median and a 47.5 point difference in the mean CreditScore between Good and Bad Standing loans.
  • Significantly large number of Employed Borrowers have Good Standing loans. I will not be considering this variable for future analyses.
  • The majority of borrowers with Good Standing loans made over $50,000. The majority of borrowers with Bad Standing loans made less than $75,000.
  • Higher number of Homeowners have Good Standing loans. Larger number of non Homeowners have Bad Standing loans. The difference in the numbers is very small. I will not be considering this feature for future analysis.
  • Higher number of borrowers with Bad Standing loans have High ProsperRiskRating.
  • The loans with low Prosper Rating of ‘HR’, ‘E’ & ‘D’ have comparatively higher Bad LoanStanding borrowers.
  • The mean and median LoanOriginalAmount is significantly low for High Risk loans. The mean and median LoanOriginalAmount is very close for Medium & Low Risk loans.
  • The mean & median Employment Duration for borrowers is very close to each other for all loans, which is an interesting find. I expected High risk loans to have a significantly lower Employment Duration compared to Low risk loans.
  • The trend in the BorrowerAPR is expected. There is a significant increase in mean and median of BorrowerAPR from Low risk to Medium risk to High Risk.

Did you observe any interesting relationships between the other features?

  • There is a negative correlation between BorowerAPR and Credit Score (Increasing Credit Score = Lower BorowerAPR).
  • In the LoanOriginalAmount by CreditScore plot, we see an initial plateau (low credit score = less loan amount), a linear increase (increasing credit score = increasing loan amount), a plateau with occasional peaks (increasing credit score = almost similar loan amounts, with occasional high loan amounts) and a linear increase at the end (increasing credit score = increasing loan amount).
  • I found a relationship in the DebtToIncomeRatio by CreditScore plot that excluding the outliers, almost looks parabolic (pointing down). There are two regions to the far left and right of this graph for which the debt-to-income-ratio is relatively low but the credit score is low on the left end and high on the right end. My guess is that on the low end, one will find several loans which have borrowers making less than $50,000 a year and on the high end of credit score, one will find several loans which have borrowers making more than $50,000 a year.
  • The range of CreditScore is high in the “$0” IncomeRange category. This could be because of Job loss or bankruptcy.
  • The mean loan amount for the ListingCategory ‘Baby&Adoption’ is the second highest after Debt Consolidation. The mean & median BorrowerAPR is highest for ‘Cosmetic Procedure’. The highest mean & median CreditScore is for ‘Boat’.
  • The 60 month loan term has the highest mean and median loan amounts. The 12 month loan period has the lowest mean and median loan amounts. The mean and median BorrowerAPR seems to be very close for all loan terms.
  • The mean and median LoanOriginalAmount is significantly low for High ProsperRiskRating loans.
  • The mean & median Employment Duration for borrowers is very close to each other for all loans, which is an interesting find. I expected High risk loans to have a significantly lower Employment Duration compared to Low risk loans.
  • The trend in the BorrowerAPR is expected. There is a significant increase in mean and median of BorrowerAPR from Low risk to Medium risk to High Risk.

What was the strongest relationship you found?

  • The strongest relationship I found in the numerical variables is for BorrowerAPR followed by CreditScore.
  • With the categorical variables, the strongest relationship i found is between IncomeRange, LoanStanding and ProsperRiskRating.

Final Plots


Mutivariate Plot 1

  • From the scatterplot and the jitter plot, we can see that the lower creditscores are dominated by borrowers falling in the “$0”-“$50,000” categories. The same borrowers have a higher debt to income ratio.
  • From the jitter plot, we can see that the higher credit scores are dominated by borrowers falling above the “$75,000” category. The same borrowers are concentrated at lower debt to income ratios.
  • The faceted plot shows that the majority of borrowers in the Bad standing facet have income range lower than“$50,000”.
  • The majority of borrowers in the Good standing facet have income range more than “$50,000”.
  • The ProsperRiskRating facet follows a similar trend to Loan Standing. We see a majority of borrowers with income range lower than “$50,000” in the High Risk facet. A significant majority of borrowers in the Low risk facet have income more than “$75,000”. The Medium risk facet is dominated by borrowers with income range higher than “$50,000”.
  • The overall trend of the plots show that higher proportion of borrowers with IncomeRange lower than “$50,000” have high DebtToIncomeRatio, Lower CreditScores, High ProsperRiskRating and Bad Standing Loans (Dominate the left upper part of the plots).
  • Larger proportion of borrowers with IncomeRange higher than“$50,000” have lower DebtToIncomeRatio, higher CreditScore, Low to Medium ProsperRiskRating and Good Standing loans (Dominate the Right lower part of the plots).

Multivariate plot 2

  • From the first jitter plot we can see that the left top is dominated by borrowers with income range less than “$50,000” and that the right bottom is dominated by borrowers with income range higher than “$75,000”.
  • Faceting by ProsperRiskRating, we see that majority of High risk loans have high APR and the borrower income range is less than “$50,000”. Medium risk rating has loans with comparatively lower APR and the borrower income range seems to be well distributed above “$25,000”. Low risk rating loans have a significantly lower APR and the borrower income range is dominated by above “$75,000”.
  • Faceting by Loan Standing, we see that a high proportion of borrowers with bad standing loans have an income lower than “$50,000” with higher APR loans. Good standing loans have a similar distribution of loan APR and CreditScore variance, but there is a higher proportion with income range above “$50,000”.

Multivariate Plot 3

  • We can see from the graph that homeowners have higher median loan amounts than people who do not own a home in all states.
  • DC has the highest Median LoanOriginalAmount.
  • The median LoanOriginalAmount is lower for bad loans in all the states except in NH (where it is equal).
  • By faceting with ProsperRiskRating, we can see that there is a significant decrease in median LoanOriginalAmount for High Risk loans in all States.
  • The median LoanOriginalAmount for Medium and Low ProsperRiskRating Loans is almost similar.
  • This is expected because Prosper assigns very few loans with Low Risk (Prosper prefers a High Risk loan being in Good Standing to a Low Risk loan going into Bad Standing.)
  • IsBorrowerHomeowner is definitely a motivator for people to get a loan.
  • Owning a home means that the borrower has an existing approved loan which is a good indicator for lenders. Therefore Homeowners having a higher loan original amount and being in good loan standing is expected.

Were there features that strengthened each other in terms of looking at your features of interest?

  • IncomeRange and CreditScore strengthened each other in terms of predicting LoanStanding & ProsperRiskRating. From the BorrowerAPR vs CreditScore plot & DebtToIncomeRatio vs CreditScore plot, we can see that there were very few loans in a status of Bad LoanStanding & High ProsperRiskRating which had credit scores beyond 800 and also had an income above $50,000. There were also very few borrowers with CreditScore below 600 and with IncomeRange below $50,000 in the Good LoanStanding with Low ProsperRiskRating.
  • For DebtToIncomeRatio, we can see that in borrowers making less than $50,000, the proportion of them who had Good & Bad LoanStanding with High & Medium ProsperRiskRating and had a DebtToIncomeRatio greater than 0.25 was far greater than that of borrowers making greater than $50,000 (most of the borrowers making greater than $50,000 were in Good LoanStanding with Medium & Low ProsperRiskRating category and had ratios less than 0.25). This was across all credit scores. The same could be said for BorrowerAPR but the difference in number of borrowers will be smaller between those making less than $50,000 and those making greater than $50,000.
  • IsBorrowerHomeowner is definitely a motivator for people to get a loan. Owning a home means that the borrower has an existing approved loan which is a good indicator for lenders. Therefore Homeowners have higher median LoanOriginalAmount Loans in Good LoanStanding.
  • In conclusion, IncomeRange and DebtToIncomeRatio strengthen each other more than IncomeRange and BorrowerAPR in terms of predicting LoanStanding and ProsperRiskRating. IsBorrowerHomeowner and LoanOriginalAmount also strengthen each other significantly in terms of predicting LoanStanding and ProsperRiskRating.

Were there any interesting or surprising interactions between features?

  • I was surprised that the relationship of BorrowerAPR-IncomeRange was not as strong as DebtToIncomeRatio-IncomeRange. I expected the opposite would be true based off the bivariate box plots.
  • The significant difference in Median LoanOriginalAmount across all States between homeowners and non homeowners is also an interesting find.

Final Plot

  • I will check the relationship between LoanStanding and ProsperRiskRating. I will deconstruct the ProsperRiskRating variable to show and confirm the trends I’ve seen in my analysis.

  • The first thing to notice is that there are significantly lesser loans assigned a rating of ‘AA’ that have Bad LoanStanding (Delinquent, Default & Chargedoff loans).
  • Almost all of the ‘AA’ ProsperRating loans have ProsperScore above ‘6’. It appears then that for a loan to have a ProsperRating of ‘AA’ is to have a very low (almost 0%) chance of resulting in Bad Standing (Delinquent, Default & Chargedoff) loans (this is only for this sample data and more data would be needed to confirm this).
  • Starting with a ProsperRating of ‘D’ and continuing until ‘HR’, the proportion of Bad Standing (Delinquent, Default & Chargedoff) Loans seems to stay constant (around 1.25%) but the proportion of Good Standing (Current & Completed) loans continues to decrease. It should be noted that the proportion of Bad Standing (Delinquent, Default & Chargedoff) loans increases significantly from ‘D’ through ‘HR’. A ProsperRating of ‘C’ has the largest number of Good Standing (Current & Completed) loans with around an equal number of loans being between ‘1’-‘6’ and ‘7’-‘11’.
  • The proportion of loans with Good Standing (Current & Completed) loans is greatest for ‘AA’ loans.
  • The number of Bad Standing (Delinquent, Default & Chargedoff) loans increases from ProsperRating of ‘AA’ through ‘HR’ and the number of Good Standing (Current & Completed) loans increases from a rating of ‘AA’ to a rating of ‘C’ but decreases from a rating of ‘D’ through ‘HR’ (note that the proportion of Good Standing (Current & Completed) loans within groups actually increases from ‘AA’ to ‘C’ but decreases from ‘D’ to ‘HR’).
  • Almost all the loans assigned a rating of ‘HR’ had ProsperScore between ‘1’ and ‘5’ but there were some loans assigned ratings of ‘C’ through ‘E’ that also had ProsperScore between ‘8’ and ‘11’ (Most of these loans fell into the Good Standing category but a few fell into the Bad Standing category).
  • For ratings of ‘B’ through ‘AA’, there were some loans assigned scores ‘1’-‘4’, but almost all fell into Good LoanStanding (Current & Completed) category.
  • In terms of ProsperScore and ProsperRating strengthening each other, combinations of very good ProsperScores and very good ProsperRatings (Low ProsperRiskRating) do tend to minimize Bad Standing (Delinquent, Default & Chargedoff) loan statuses but the opposite is not true.
  • Combinations of bad ProsperScores and bad ProsperRatings (High ProsperRiskRating) don’t necessarily minimize Good Standing (Current & Completed) status (rather they maximize statuses of Bad Standing).

Reflection


  • The Prospser loans dataset contains over 100k observations with 81 variables spanning across 9 years.
  • Understanding the variables, terminology and general domain knowledge of financial peer-to-peer lending was the first obstacle in analyzing this dataset.
  • The one ongoing hurdle was determining which variables to analyze, not drifting too far off any one path of investigation and not pulling in new variables throughout the process.
  • Another persistent issue was overplotting on scatterplots, a number of techniques were used across multiple plots to try to mitigate this issue.

  • However, success was found in many areas. The general analysis revealed areas of interests such as correlations between CreditScore, BorrowerAPR, DebtToIncomeRatio, IncomeRange and IsBorrowerHomeowner, LoanOriginalAmount which brought up questions concerning LoanStanding and the perplexity of cause and effect. Also, trends were confirmed and unexpected relationships such as BorrowerAPR by IncomeRange & LoanOriginalAmount by ProsperRiskRating were revealed.

  • Some features and relationships which surprised me are
  • I was surprised that BorrowerApr didn’t appear to be strengthened as much by IncomeRange as DebtToIncomeRatio.
  • The number of borrowers with an income of “$100,000+” is very high (more than “$75000-99,999”). This could be because a large number of users with high income consider a peer-to-peer loan.
  • The states of Maine, Iowa & North Dakota do not have any borrowers post July 2009.

  • Having only analyzed 16 of the original 81 variables leaves a lot of undiscovered relationships and trends.
  • Additional data would also enhance this dataset.
  • Having the borrower’s age and sex would allow analysis to possibly discover trends among men & women or young & old.
  • Also, population and state-average-income features, would allow for discovery of the type of environment the borrower lived in.
  • For example, a borrower around $75,000 income range living in California could be considered middle class within that state in comparison to a borrower from Iowa in the same income range bracket which might be considered in the upper class.
  • How does your class (lower, middle or upper) determine if you use peer-to-peer loans and/or become a borrower with bad loan standing? This would be dependent on the borrower’s state and the state’s average income range. These types of questions and more could be answered with additional data.

Potential further steps?

In future analyses, I would like to build models with the dataset, using logistic regression to predict LoanStanding using as many variables as possible.